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Abstract 


Automatic Language Identification (ALI) is the problem of automatically identi- 
fying the language of an utterance through the use of a computer. In 1977, House and 
Neuburg proposed an approach to ALI which focused on the phonotactic constraints 
of different languages. Their work suggested that simple language models could be 
used effectively for language identification if an accurate phonetic representation of 
an utterance could be obtained from the acoustic signal. Our research utilizes House 
and Neuburg’s ideas as the starting point for a new segment-based approach to ALI. 

To develop a solid theoretical basis for the design of an ALI system, a formal prob- 
abilistic framework has been developed. This framework uses House and Neuburg’s 
ideas as its foundation but also utilizes additional information that may be useful for 
ALI. Specifically, phonotactic, acoustic and prosodic information are all incorporated 
into the framework which provides the structure for the segment-based system. 

To investigate the capabilities of the new segment-based approach, the system 
was trained and tested using the OGI Multi-Language Telephone Speech Corpus, 
which consists of utterances in 10 different languages. The entire system was able 
to identify the language of a test utterance 48.6% of the time. To investigate the 
system’s performance in more detail, the entire system, as well as each component 
of the system, was evaluated as various test conditions were altered. Overall, the 
analyses of the system confirmed that the phonotactic constraints of languages can be 
used effectively for ALI. However, it was also discovered that additional information, 
such as prosodic and acoustic information, can also be useful to supplement the 
phonotactic information. 
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Chapter 1 


Introduction 


1.1 Overview 


Automatic Language Identification (ALI) is the problem of automatically identifying 
the language of a spoken utterance through the use of a computer. Although research 
of the ALI problem began over 20 years ago, until recently there have only been a 
handful of published studies conducted on the topic. Early interest in the ALI problem 
originated within the intelligence community where automated language identification 
could provide obvious benefits. More recently, with increased activities in the devel- 
opment of multi-lingual speech recognition/understanding systems, interest in ALI 
has spread into the academic and industrial communities as well. Applications such 
as machine translation and multi-lingual information retrieval could benefit greatly if 
effective methods for identifying the language a person is speaking can be developed. 

Figure 1.1 shows how a language identifier would fit into a multi-lingual informa- 
tion retrieval system. For this system the job of the language identifier is to determine 
what language is being used in the incoming utterance so that the utterance can be 
passed to the proper speech recognition/understanding system. Ideally the language 
identifier should achieve a high accuracy rate in identifying the language of spoken 
utterances while also being computationally efficient. However, in reality one must 
consider the tradeoff between accuracy and efficiency. 

Upon initial examination of the language identification problem, one may note 
that each language of the world can be distinguished from any other language by its 
own unique vocabulary. However, utilizing knowledge about the unique vocabulary 
of each language would also require a knowledge of the syntactic and semantic rules 
which govern the concatenation of words into spoken utterances. Clearly it would be 
possible to develop a nearly flawless ALI system if this information could be success- 
fully incorporated into a system. By example, this approach to ALI could be handled 
by simply developing a speech recognition system for all possible languages. The lan- 


English 
Recognition / 
Understanding 


Japanese 
Recognition / 
Understanding 

Identifier 


Spanish 
Recognition / 
Understanding 


Figure 1.1: A multi-lingual system using a language identifier 


guage of an utterance would be determined when the recognizer trained for the correct 
language is able to produce a viable string of words to match the waveform, while 
the recognizers for other languages are unable to decipher the input. However, there 
are two main reasons why this type of approach may be impractical. First, extensive 
expert knowledge of multiple languages may require a tremendous effort to collect, 
organize, and incorporate into an ALI system. Second, even if extensive expert knowl- 
edge is available and can be incorporated into a system, it may be computationally 
impractical to use all of this knowledge to identify the language that is being spoken. 
Thus, the goal of ALI research to date has generally been to develop dependable 
language identification methods which do not rely upon higher-level knowledge of the 
languages involved. Additionally, many past ALI studies have concentrated on uti- 
lizing only the information that is directly available from the waveform (i.e., acoustic 
features). 

It appears plausible that accurate ALI may be achieved utilizing only the informa- 
tion that is available from the waveform of a spoken utterance. It has been observed 
that humans often have the ability to identify the language of a spoken utterance 
even when they have no working knowledge of the vocabulary or syntax of that lan- 
guage [28]. As will be discussed in Chapter 2, an investigation into the properties 
of different languages reveals that languages often differ in their phonological and 
prosodic characteristics. These characteristics are evident in the waveform of a spo- 
ken utterance. It is the differences in these characteristics which has motivated all of 
the ALI approaches to date, including the research presented in this thesis. 
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1.2. Previous Work 


The most common general approach to the ALI problem has been the use of frame- 
based statistical methods. These methods use acoustic models to identify the language 
of a spoken utterance based on the frame by frame statistics of the utterance’s acoustic 
features. The studies by Cimarusti and Ives [1], Ives [13], Foil [6], Goodman et al. [10], 
Sugiyama [34], Savic et al. [32] and Zissman [36] are similar in that each used a frame- 
based language identification algorithm which was trained on acoustic features of the 
speech signal in an unsupervised fashion. Thus, none of these studies used any prior 
knowledge of the underlying phonetic or prosodic structure of their data. While the 
specifics of the classification algorithms are different in each case, each algorithm 
was designed to identify the language of an utterance based only on the statistics 
of acoustic features. None of these approaches attempts to model the speech as a 
sequence of linguistic events. 

The earliest published research in ALI in this country was performed by Leonard 
and Doddington [18, 19, 20, 21]. They developed an approach where language identifi- 
cation was performed by identifying sound segments or sequences which are particular 
or common to specific languages. Once a set of useful sound segments was proposed, 
language identification was performed by examining the probability distribution of 
the selected sound segments within a speech utterance. This approach was based on 
the assumption that certain linguistic events occur more frequently in particular lan- 
guages and the observed statistics of these events can provide for accurate language 
identification. 

A similar approach to ALI was proposed by House and Neuburg [11]. Like Leonard 
and Doddington, they believed language identification could be performed by ob- 
serving the statistics of the linguistic events present in a speech utterance. More 
specifically, they believed that languages could be identified based upon the sequen- 
tial constraints of their phonetic elements. Based on this belief they proposed a two 
step approach to ALI. The first step was to transform an utterance into a string of 
phonetic elements. The second step was to identify the language of the utterance by 
examining the statistics of the phonetic sequence. However, they believed that the 
extraction of a detailed phonetic sequence from a spoken utterance of an unknown 
language could not be performed with sufficient reliability and, in fact, might not 
even be necessary. Instead, they proposed an approach where the speech input was 
transformed into a sequence of broad phonetic classes. They believed the automatic 
extraction of the underlying string of broad phonetic classes of a spoken utterance 
could be performed with high reliability, though they did not confirm this hypothesis 
on actual speech data. However, in a feasibility study, they did confirm their belief 
that the statistics of sequences of broad phonetic classes would be sufficient for re- 
liable language identification given a long enough phonetic sequence. They showed 
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this empirically by transcribing texts from eight different languages into strings of five 
broad phonetic classes and evaluating bigram and trigram models applied to these 
transcribed texts. 

The results presented by House and Neuburg offer the hope that very simple 
phonetic language models can be powerful tools for language identification. While 
their work solidly showed that simple phonetic language models work exceptionally 
well when the underlying string of broad phonetic classes for an utterance is known 
exactly, they did not prove that these language models could be robust when the 
string of phonetic classes contained errors. However, a few studies that utilize House 
and Neuburg’s basic premise have been conducted. 

The work of Li and Edwards [23] was the first attempt following the general frame- 
work proposed by House and Neuburg to be tested on actual speech data. They de- 
signed a frame-based classifier which labeled each frame of an utterance with a broad 
phonetic class. Using a post-processing smoothing algorithm, they transformed the 
frame-based sequence of phonetic labels into a sequence of segments labeled with 
broad phonetic classes. The language identification was then performed using various 
finite state statistical models on the sequences of broad phonetic classes. Unfortu- 
nately, their study demonstrated that House and Neuburg’s approach was effective 
but not infallible. Their results showed that the use of an imperfect phonetic recog- 
nizer for determining the string of broad phonetic classes clearly hurt the ability of 
the language models to perform highly accurate language identification. 

A study by Muthusamy and Cole [27, 28] also utilized the idea of transforming 
the input speech into a sequence of broad phonetic classes. However, they did not 
limit the language identification process to simply building language models for the 
phonetic class sequence. Instead, they devised an approach where various phonetic 
and prosodic features were extracted from the segments of the phonetically labeled 
utterance. A neural network which was trained using these features was then used to 
perform the language identification. 

Lamel and Gauvain [16] used an approach where a phonetic recognition system 
was trained separately for each language. The training produced language dependent 
phone and language models for each language. The language of a test utterance was 
then determined by applying each language dependent phonetic recognizer to the 
utterance and choosing the specific recognizer which produced the highest normalized 
likelihood score (i.e., the recognizer which was able to produce the closest match 
between the waveform and its own language specific models). Lamel and Gauvain 
only tested their approach on the two language set of English and French. For large 
language sets this approach could become computationally burdensome. 

It is very difficult to determine which of the above approaches to the ALI problem 
are the most effective. For the most part, each of the studies mentioned above utilized 
a different speech corpus. These corpora varied over many different conditions includ- 
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Number of | Avg. Length 
Languages of Test Reported 
Authors of Study Year Utterances | Accuracy 


980 
Li and Edwards 1980 
1082 


199 


Table 1.1: Summary of previous published results 


ing their language sets, bandwidths, channel characteristics, vocabulary constraints, 
and test utterance lengths. Without a common set of test conditions, a meaningful 
comparison of the results reported in the different studies is not possible. Neverthe- 
less, a brief summary of the results that have been published is shown in Table 1.1. It 
should be mentioned that the Muthusamy and Cole system and the Zissman system 
were both tested on the OGI Multi-Language Telephone Speech Corpus [29]. This is 
the same corpus that was used for the experiments that are presented in this thesis. 


1.3. Thesis Overview 


The ultimate goal of ALI research is to develop language identification methods which 
are reliable, computationally efficient, and easily portable to new language sets. How- 
ever, the scope of this thesis is limited to research towards the development of a reli- 
able ALI approach which does not require higher level knowledge of the languages it 
is attempting to identify. The research presented in this thesis does not consider the 
issues of computational efficiency or portability to new languages. In its investigation 
of the ALI problem, the basic goals of this thesis can be summarized as: 


1. Present a formal probabilistic framework describing the ALI problem. 
2. Present a new segment-based approach to the ALI problem. 


3. Analyze and understand the various modeling decisions, assumptions, and test 
conditions which affect the performance of the system. 
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As the starting point for the development of a new ALI design, a formal proba- 
bilistic framework of the ALI problem has been derived. Unlike the automatic speech 
recognition problem, no formal probabilistic framework describing the ALI problem 
has been presented in any of the existing papers on the subject. Such a framework 
is presented in Chapter 2. It utilizes House and Neuburg’s ideas as a foundation and 
provides the structure for the ALI design which is described in this thesis. 

Utilizing the probabilistic framework, a new segment-based approach to the ALI 
problem has been developed. Like Muthusamy and Cole’s system, the new design 
retains the basic ideas of House and Neuburg while also allowing for additional infor- 
mation to be used in the language identification process. The basic architecture of the 
new system is shown in Figure 1.2. In this diagram a and f represent the acoustic and 
fundamental frequency information that is extracted from the waveform, C’' represents 
the string of phones or broad phonetic classes that the phonetic recognizer extracts 
from the acoustic information, and S represents the segmentation of the waveform 
which matches the phonetic string C’. In this design the language identifier may use 
any information that is available from the acoustic feature vectors, fundamental fre- 
quency contour, phonetic sequence or segmentation. A detailed description of each of 
the components in this system is provided in Chapter 3. An analysis and evaluation 
of the performance of the new system is presented in Chapters 4 and 5. 


Phonetic Language | language 


Preprocessor : ; 
Recognizer Identifier 


Figure 1.2: Proposed ALI Design 
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Chapter 2 


Theory 


2.1 Discriminative Information for ALI 


2.1.1 Overview 


The design of a successful ALI system must begin with an understanding of the char- 
acteristics of spoken language which are most useful for the purpose of language iden- 
tification. An ALI system needs to exploit the primary differences which exist among 
languages while still being robust in the face of speaker, channel and vocabulary vari- 
ability. However, the system also needs to be computationally efficient. Thus, it is 
desirable to discover language discriminating characteristics which are relatively easy 
to extract from the acoustic signal, do not require complex methodologies to model, 
and are relatively free of noise from speaker, channel and vocabulary dependencies. 

As discussed in Chapter 1, it may be possible to develop an ALI system which 
can accurately identify languages based only on information that is directly available 
from the waveform of a spoken utterance. The information that is available in an 
utterance’s waveform can be viewed as belonging to one of two groups, phonologi- 
cal information and prosodic information. The series of spoken sounds (or phones) 
which is present in the spoken utterance contains the phonological information. The 
fundamental frequency, intensity and duration variations that span across the spoken 
utterance contain the prosodic information. While the phonological and prosodic in- 
formation available in the signal may represent some higher level information which 
is useful in determining the semantics of an utterance, knowledge of this higher level 
information may not be needed to identify the language of the utterance. It is hoped 
that adequate language identification can be performed using only the phonological 
and prosodic information of an utterance. 
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2.1.2 Phonological Information 


The phonological properties of a spoken utterance can vary greatly from language to 
language. There are various phonological factors which help define the distinctiveness 
of a language. Some of these factors include the phone set, the phonotactic constraints 
and the acoustic realizations of particular phones within a language.! 

Because each language uses only a small subset of phones from the set of all 
possible speech sounds which exist, variances can be observed across the phone sets of 
different languages [31]. Thus, a knowledge of the phones used in particular languages 
may be enough to help distinguish one language from another. Even if languages 
contain nearly identical phone sets, the languages may still be distinguishable by the 
probability distribution of the phones across each language. Thus, a phone that is 
commonly used in one language may be used rarely in another. 

Different languages may also have different rules governing how sequences of 
phonemes may be constructed to form higher level linguistic elements such as syllables 
or words. These phonotactic constraints could cause certain phonetic sequences to 
be likely in some languages but unlikely in others. For example, Japanese has strict 
phonotactic constraints which generally prohibit consonants from following conso- 
nants. English, on the other hand, has looser constraints which allow for the possi- 
bility of multiple consonants in succession. 

Significant differences may also exist in the acoustic realization of particular 
phones across different languages. These differences may be caused by cross language 
differences in the articulatory gestures used to produce the phone. For example, the 
phoneme /t/ can be realized by a large set of allophones. It can be realized with 
or without aspiration, with a dental or alveolar closure, and with lips rounded or 
unrounded. The use of each of these allophones varies across languages. Some dif- 
ferences in the acoustic realizations of particular phones across languages may occur 
because of the particular phonotactic constraints present within each language. The 
phonotactic constraints of different languages may cause certain coarticulation effects 
to be possible in one language and not possible in another. Thus, the differences that 
arise in the acoustic realizations of phones may be useful for distinguishing languages. 


'For clarification, the difference between a phoneme and a phone should be stated. A phoneme is 
strictly a linguistic unit. A phone is a particular speech sound. A phone can viewed as the acoustic 
realization of a phoneme. Since higher level linguistic knowledge is not being used in the ALI design 
presented in this thesis, knowledge of the particular phonemes that exist in each language is not as 
important as knowledge of the particular phones that exist in each language. Thus, all references 
to phonetic elements, sequences, etc. that are made within this thesis refer to the phones within an 
utterance and not the phonemes. 
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2.1.3. Prosodic Information 


The prosodic properties of languages can also vary greatly. Fundamental frequency 
(FO), duration and voice intensity are all important elements used within the prosodic 
structure of a spoken utterance. The manner in which these elements are incorporated 
into the prosodic structure of an utterance varies across languages. The differences 
across languages can often be observed in the realization of the prosodic features 
which determine the tones or stress contained throughout an utterance. 

In tonal languages such as Chinese, the FO contour and segment duration are 
used in determining the tone attached to a particular phone. Altering the tone for a 
particular phone can completely change the meaning of the word to which the phone 
belongs. Thus, in a tone language the FO and phone duration patterns are strongly 
dependent on the types of tones used in that language and their relative probability 
distributions. 

In languages that incorporate the concept of word stress, the intensity, duration, 
and FO contour of a syllable are all correlated with the inherent stress being placed on 
that particular syllable [35]. Different languages use stress in different manners. For 
free stress languages, such as English, the stress pattern of words can vary between 
words with the same number of syllables. For fixed stress languages, such as Polish, 
the stress pattern is dependent only on the number of syllables present in each word. 
Thus, two words with the same number of syllables will always have the same stress 
pattern [31]. The exact manner in which FO, duration and intensity contribute to the 
stress of a syllable may also differ from language to language. For example, the timing 
of rises or falls of the FO contour in relation to the placement of stressed syllables 
can vary. Some languages use a rising FO at the beginning of a stressed syllable while 
others use a rising FO at the end of a stressed syllable [35]. 

It has also been observed that some languages use the FO contour to represent even 
higher level linguistic information. The FO contour of the end of an utterances has 
been observed to differentiate between declarative statements and yes/no questions 
in languages such as English, French, Italian, and Japanese [35, 2]. In some languages 
such as English, declarative statements are characterized by a falling FO contour at 
the end of an utterance while yes/no questions are characterized with a rising contour. 
However, other languages have been observed to contain just the opposite, a rising 
contour for declaratives and a falling contour for questions. 

The effect of prepausal lengthening of vowels is another prosodic effect which has 
been observed to differ across languages. Lengthening of the final vowel in a sentence 
is a readily observable characteristic of spoken utterances in English, French, German 
and Italian. However, other languages such as Finnish, Estonian, and Japanese have 
been observed to contain little to no sentence-final lengthening of vowels [35]. 
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2.2. Probabilistic Framework 


2.2.1 Maximum A Posteriori Probability Approach 
General Derivation 


Before designing any system, it is desirable to develop a strong theoretical framework 
on which the design can be based. For this thesis the framework will be probabilistic 
in nature. To begin, let L = {Lj, Lo,..., L,} represent the language set of n different 
languages. When an utterance is presented to the ALI system, the system must use 
the acoustic information to decide which of the n languages in L was spoken. 

Typically, the acoustic information of a spoken utterance is represented as a se- 
quence of feature vectors where each individual vector represents the acoustic in- 
formation for a particular time frame. For this derivation, it will be assumed that 
two specific types of information will be extracted from the waveform for each time 
frame; these are the wide-band spectral information and the voicing information. 
The wide-band spectral information is the most useful information for determining 
the underlying phonetic sequence of a spoken utterance. The voicing information, i.e. 
the FO contour, is primarily used in describing the prosody of an utterance. Because 
of the separate natures of the two types of information, it is useful to represent them 
as two separate sequences of vectors. Therefore, let a = {8, @,...,8,,} be the se- 
quence of m vectors which represent the wide-band spectral information of a spoken 
utterance and let f = {fot nist geet be the sequence of m vectors which represent 
the voicing information of a spoken utterance. To clarify the terminology used in this 
thesis, the wide-band spectral information contained in a will be referred to as the 
acoustic information and the vectors contained in a will be referred to as acoustic 
feature vectors. The information in f will be referred to as the FO information. 

The probability that an utterance was spoken in language L;, given the sequences 
a and f, is represented by the expression Pr(L; | a, f). The maximum a posteriori 
probability (MAP) approach to the ALI problem is to choose the language which 
is most likely given the acoustic and FO information. Mathematically this can be 
expressed as 


=> => 


Choose L,; such that Pr(L, | a,f) > Pr(L; | a,f) Vi Fj. (2.1) 


Viewed as a maximization process the MAP approach can alternatively be expressed 
as 
arg max Pr(L; | a, f). (2.2) 
v 


The expression in (2.2) is the most general expression describing the ALI problem 
and should serve as the starting point for any probabilistic approach to ALI. 
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Incorporating Linguistic Information 


Because each spoken utterance contains an underlying sequence of linguistic events, 
a probabilistic framework which incorporates linguistic information is appropriate. 
To incorporate this information into the framework, let C represent the set of all 
possible linguistic sequences which can represent a spoken utterance. Since phonetic 
elements are the most obvious choice for representing the linguistic sequence, it will 
be assumed that the sequences in C are represented with phonetic elements in the 
derivations that follow. Specifically, each unique phonetic sequence will be of the 
form C = {c1,c2,...,¢)} where each c is represented with a phonetic element. The 
set of elements that can be used in the phonetic sequence can be chosen to be as 
detailed as phones or as general as broad phonetic classes. However, the exact set of 
elements that are to be used in the design is not important for the derivation of the 
probabilistic framework. By incorporating the phonetic sequence into the framework, 
the expression in (2.2) becomes 


arg max )- Pr(L;, C' | a, f). (2.3) 
6 


Proceeding from (2.3), there are two general categories of approaches which can 
be developed, frame-based and segment-based. In a frame-based approach, the prob- 
abilistic framework mandates that a phonetic element be associated with each single 
frame of the acoustic input. In a segment-based approach, the model assumes that 
sets of adjacent frames may underlyingly belong to the same phonetic element. Thus, 
in segment based approaches, only one phonetic element will be associated with each 
segment or set of related adjacent frames. These two different approaches are dis- 
cussed separately below. 


2.2.2. Frame-Based Approach 


The main constraint in defining a frame-based approach is that each element of a 
phonetic sequence is mapped one-to-one with its corresponding acoustic frame. If a 
contains m frames then all allowable phonetic sequences C' must contain m elements. 

In deriving any probabilistic approach, it is often useful to expand general prob- 
abilistic expressions such as (2.3) into multiple probabilistic terms which are simpler 
to model. To begin, (2.3) can be reworked as 


ee (2.4) 


Since the denominator in (2.4) is independent of i, it can be removed from the max- 
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imization process to yield 


=> 


argmax )> Pr(L;, C, a, f). (2.5) 
AG 


The expression can further be rewritten as 


argmax S_ Pr(a | f, C, L;) Pr(f | C, L;) Pr(C | L;) Pr(Li) (2.6) 
PE 


The transformation of (2.3) into (2.6) is useful because (2.6) is organized in such a 
fashion that the acoustic, FO, and phonetic sequence information can all be modeled 
separately. Despite the organization of the expression, it lacks a direct means of 
modeling the durations of the underlying phonetic elements. Because frame-based 
approaches do not incorporate the notion of segments, the information regarding the 
duration of the underlying phonetic elements is not explicitly available but rather is 
embedded within the sequence C. 

It should be noted that the expression in (2.6) can be easily simplified to form 
the probabilistic description of a hidden Markov model (HMM) approach. The HMM 
approach has been widely used for many speech recognition related problems including 
ALI [32, 36]. The HMM approach can be formulated by applying the following 
assumptions: 


1. fis independent of a and C. 
2. The frames of a are independent. 
3. Cis a Markovian sequence. 
With these assumptions, the HMM approach can be represented by the expression 


arg max Pr(L;) Pr(f | Li) SJ] Pras | ce, Li) Pr(ce | ces, Li). (2.7) 
C k=1 


2.2.3 Segment-Based Approach 


For a segment-based approach, the concept of segmentation of the input speech must 
be incorporated into the probabilistic framework. To do this, let S represent the set of 
all possible segmentations of the input speech. In using a segment-based approach, the 
set of phonetic sequences that can belong to a particular segmentation is constrained 
by the assumption that there is a direct one-to-one mapping of phonetic elements 
to segments. To represent a particular segmentation containing p segments, let S = 
{51, 52,-.-,Sp41} where each s represents the location of a segment boundary. The 
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only allowable set of phonetic sequences which can correspond to S are those with 
p phonetic elements. Thus, given S the phonetic sequence can be represented as 
C = {c1,c2,...,G}. With these new considerations the maximization process in 
(2.3) can be expanded as 


argmax )~ >> Pr(Li, $,C | 4,f). (2.8) 
a BG 


This expression can be rewritten as 


> => 


argmax 5” S> Pr(L; | C,S,a,f) Pr(C | S,a,f) Pr(S | a, f). (2.9) 
is 


Up to (2.9) no assumptions have been made, i.e., (2.9) is exactly equivalent to (2.2). 
However, with the tremendously large set of possible segmentations and phonetic 
sequences that could represent each utterance, it would be impractical to attempt 
to perform the summations in (2.9) over all S and all C. The required computation 
can be greatly reduced if only a subset of the possible segmentations and phonetic 
sequences are used in estimating the probabilities of each candidate language. These 
probabilities can potentially be estimated accurately using only the n-best phonetic 
hypotheses. To take this assumption even further, it may be feasible to assume that 
only the most likely segmentation and phonetic sequence needs to be found in the 
process of identifying the most likely language candidate. In this case, the expression 
in (2.9) can be reduced to 


> > 


arg max Pr(L; | C,S,a,f) Pr(C | S,a,f£) Pr(S | a, f). (2.10) 


Additionally, it may be feasible to decouple the search for the most likely seg- 
mentation and phonetic sequence from the search for the most likely language. This 
would assume that the best segmentation and phonetic sequence can be found inde- 
pendent of the language of the utterance. The result of this assumption is that the 
maximization process in (2.10) can be separated into two steps. First, the most likely 
segmentation and phonetic sequence are found using 


> 


arg max Pr(C' | $,a,£f) Pr(S | a, f); (2.11) 


Let the most likely segmentation and phonetic sequence be represented as S and C. 
After S and C are found, the second step is to identify the most likely language using 


arg max Pr(L; | C, $,4,f). (2:12) 


21 


As discussed previously with the frame-based approach, it may be useful to expand 
the general expression in (2.12) into multiple terms for the purpose of simplifying the 
modeling of the expression. To begin, it can be reworked as 


—— (2.13) 


In the maximization process, the denominator is constant across all 7 and can be 
removed leaving 


A 


arg max Pr(L;, C, S,a, f). (2.14) 
This expression can be expanded into 


arg max Pr(a | C, $,f, L;) Pr(§,f | C, L;) Pr(C | L;) Pr(Li). (2.15) 


The four probability expressions in (2.15) are considerably easier to model sepa- 
rately than the single probability expression in (2.12). Additionally, the expression 
is now organized in such a way that prosodic and phonetic information are contained 
in separate terms. In modeling, these terms become known as: 


1. Pr(a&| C,S,f, L;) + The acoustic model. 
2. Pr(S,f | C,L;) + The prosodic model. 
3. Pr(C | L;) > The language model. 
(L;) > The a priori language probability. 


The prosodic model captures the differences that can occur in prosodic structures 
of different languages due to the stress or tone patterns created by variations in the 
phone durations and FO contour. The phonetic information is divided into two sep- 
arate models, the language model and the acoustic model. The language model will 
account for the probability distributions of the phonetic elements and the phonotactic 
constraints within each language. The acoustic model will account for the different 
acoustic realizations of the phonetic elements that may occur across languages. Aside 
from modeling concerns, this organization also provides a useful structure for eval- 
uating the relative contributions towards language identification that phonotactic, 
prosodic, and acoustic information provide. 

It should also be noted that while a maximum a posteriori probability approach 
was described in this derivation, the maximum likelihood approach can be achieved 
by simply ignoring the a priori language probability. In effect, this is identical to 
assuming all of the languages in the language set L are equally likely. 


Ze 


Chapter 3 


System Design 


3.1 System-Wide Decisions 


3.1.1 Overview 


Before making any detailed design and modeling decisions within the ALI system, a 
set of system-wide issues must first be resolved. These key issues can be summarized 
in the following questions: 


e What is the goal of the ALI system? 
e What type and amount of data will be used for training and testing? 


e What criteria will be used to evaluate the system? 


3.1.2 System Goals 


From the outset, the goal of an ALI system must be defined. In particular, it must be 
decided whether the system will perform language verification or language recognition. 
Systems that perform language verification simply verify whether or not an utterance 
is spoken in one particular language. Systems that perform language recognition must 
identify the language of a spoken utterance from a set of language candidates. 

If language recognition is the goal, it must also be decided whether the recognition 
will be performed with an open or closed set of languages. If a closed set of languages 
is used, the system will only be subjected to test utterances which are spoken in 
a language which is present in the system’s training set. However, if an open set 
condition is used, the system may be presented with utterances which are spoken in 
languages which do not appear in its training set. In this case, the system needs to 
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be able to reject any utterance which is spoken in a language that is not within it’s 
training set. 

The ALI system that is presented in this thesis is a language recognition system 
which operates on a closed set of languages. Thus, during testing, the system is not 
subjected to any languages which are not contained in the training set. We believe 
that it is necessary to first develop a concrete understanding of the issues involved 
in closed set language recognition before attacking the difficult issues involved in 
determining useful rejection criteria for the open set problem. Thus, this thesis only 
concentrates on the problem of reliable closed set language recognition. 


3.1.3 Data Set 


The data set that will be used must also be clearly defined. Because the discriminative 
information that is useful in language identification may vary from language set to 
language set, the set of languages should be clearly defined from the beginning. The 
amount of training data available in each language must also be carefully taken into 
account. The complexity of the models used in the system depends on the amount 
of available training data. Additionally, the system design should also consider the 
constraints placed on the vocabulary and context of the data, as well as the conditions 
under which the data set was recorded. 

For this thesis, the ALI system is evaluated using the OGI Multi-Language Tele- 
phone Speech Corpus [29]. The corpus was collected at the Oregon Graduate Institute 
(OGI).' It contains utterances collected over the phone lines, at an 8 kHz sampling 
rate, from callers who were native speakers of one of ten different languages. These 
languages are English, German, French, Spanish, Farsi, Tamil, Vietnamese, Mandarin 
Chinese, Korean, and Japanese.” The utterances include fixed vocabulary utterances, 
topic-specific utterances, and unconstrained utterances. For each speaker up to ten 
utterances were collected. Four of the ten utterances contained a fixed vocabulary. 
Four others were text-independent but topic-specific. The final two were completely 
unconstrained. The prompts used to elicit the utterances from each speaker are de- 
scribed below along with the time allotted for the speaker’s response to each prompt. 
It should be noted that a usable utterance was not always collected for each prompt. 


IWhile the OGI corpus may eventually be used for many different topics in multi-lingual speech 
research, the corpus was originally collected by Yeshwant Muthusamy to aid his research in automatic 
language identification. 

?As a reference, Appendix A contains a breakdown of the language families of each of the ten 
languages. Appendix B provides a table of the specific phones which are used in each language. 
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The fixed vocabulary utterances were elicited from each speaker with the following 
prompts in their native language: 


1. What is your native language? (3 seconds) 
2. What language do you speak most of the time? (3 seconds) 
3. Please recite the seven days of the week. (8 seconds) 


4, Please say the numbers zero through ten. (10 seconds) 


The topic-specific utterances were the responses of each speaker to the following 
prompts: 


1. Tell us something that you like about your hometown. (10 seconds) 
2. Tell us about the climate of your hometown. (10 seconds) 
3. Describe the room that you are calling from. (12 seconds) 


4, Describe your most recent meal. (10 seconds) 


The unconstrained utterances were collected by asking each speaker to speak freely 
about any topic of their choosing for one minute. Each unconstrained utterance was 
divided into two separate portions for the corpus; one with ten seconds of speech, the 
other with the remaining speech of the utterance. 

The corpus is subdivided into three groups, a training set, a development test set 
and a final test set. The training set contains the utterances from fifty speakers in 
each language. The development test set contains twenty speakers for each language. 
The final test set contains twenty speakers from each language. For this thesis, the 
training set is used for training the ALI system and the development test is used 
for testing. The final test has been set aside for future work. Furthermore, only 
the topic-specific and unconstrained utterances are utilized. The fixed-vocabulary 
utterances are not used. 

Excluding the fixed-vocabulary utterances, the training set contains 2715 utter- 
ances and the development test set contains 1120 utterances. The utterances are 
roughly evenly distributed amongst the ten languages. The number of utterances per 
speaker varies from 2 to 6. The male to female ratio of the speakers is roughly 7 to 
3. Unfortunately some languages contain over 85 percent male speakers while others 
contain only 60 percent male speakers. It should also be noted that at the time of the 
experiments in this thesis the corpus had not yet been transcribed?. Without a full 
transcription of all of the utterances in the corpus, completely supervised training for 
phonetic recognition is not possible. 


3As of the writing of this thesis, work is in progress at OGI to phonetically transcribe the 
utterances in the corpus. 
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3.1.4 System Evaluation 


The measures of performance that are used to evaluate the system must also be 
defined. The most obvious measure of performance is the system’s ability to reliably 
identify the language of a spoken utterance. Closely related to this is the system’s 
ability to identify the language family of an utterance. However, aside from reliability 
in language identification, the system may also be evaluated based on other aspects 
such as its computational requirements, required training set size, and portability to 
different language sets. For this thesis, the design of the ALI system only considers the 
system’s ability to perform reliable language identification.4 Because reliable methods 
for ALI have not yet been developed, we believe it is best to develop insights into 
the primary problem of language recognition before other issues such as computation 
and portability are considered. 

In evaluating the system’s performance, two statistics are commonly used through- 
out this thesis. These statistics are the language identification accuracy and the rank 
order statistic. The language identification accuracy is the percentage of utterances 
in which the system’s top choice language candidate is correct. The rank order statis- 
tic is the average position of the correct language within the ordered list of language 
candidates. The rank order statistic conveys more information about the system’s 
performance than the language identification accuracy and as such is the more preva- 
lent of the two statistics used in this thesis. 


3.2 General System Architecture 


For this thesis, the system is structured around the segment-based probabilistic frame- 
work described in Chapter 2. A system which utilizes this segment-based framework 
can be realized as a series of three components. These components are a preprocessor, 
a phonetic recognizer, and a language identifier. The preprocessor receives the raw 
acoustic waveform as its input and transforms this input into the frame-based feature 
vectors 4 and f. The phonetic recognizer receives the vectors a and f as its input 
and finds the best phonetic hypothesis and segmentation, C and §$. The language 
identifier then uses a, f, C and S to find the most likely language. This architecture 
is displayed in Figure 3.1. 


4Although the system’s design does not consider any issues other than reliable language identi- 
fication, evaluations of the system’s training set requirements and receiver-operator characteristics 
are presented in this thesis in Chapter 4. 
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Phonetic Language | language 


Preprocessor 


Recognizer Identifier 


Figure 3.1: System Architecture 


3.3. Preprocessing 


3.3.1 Spectral Representation 


For this thesis, the acoustic vector a is represented with mel-frequency scale cepstral 
coefficients (MFCC’s) [26]. A set of fourteen MFCC’s are computed for each utterance 
with a frame rate of 200 frames per second, a discrete Fourier transform (DFT) size of 
256, and a Hamming window of length 25.6 milliseconds. In addition to the MFCC’s, 
fourteen delta MFCC’s are also computed. The delta MFCC’s are computed with 
the expression 

i= ea =a (3.1) 

2 2 

where xz[i] and z[i] represent an MFCC value and delta MFCC value for the i‘® 
frame. The MFCC signal representation was chosen because it has proven to be an 
effective representation for speech recognition in various different languages including 
English [25], Italian [5] and Japanese [12]. 


3.3.2 Voicing Information 


For this thesis, the voicing information contained in the vector f is extracted from 
the acoustic signal with the formant program contained in Entropic’s ESPS package. 
The fundamental frequency tracker contained in the formant program is based on an 
algorithm devised by Secrest and Doddington [33]. The frame rate for f is also 200 
frames per second. For each frame, a fundamental frequency (F0) and a probability 
of voicing parameter are estimated. In an attempt to eliminate speaker dependencies 
a two step transformation is applied to the FO values. First, the logarithm (base 2) 
of FO is taken for all voiced frames (i.e. frames whose voicing probability is greater 
than .5). Second, in the logarithm domain, the mean FO value for each utterance 
is computed and subtracted from each FO value. Additionally, a delta FO value is 
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calculated (also in the logarithm domain) for each voiced frame in the same fashion 
as the delta MFCC values are found from the MFCC’s (see (3.1)). 


3.4 Phonetic Recognition 


3.4.1 Overview 


As previously stated, the fact that the OGI corpus is unlabeled prevents the use of 
a phonetic recognizer which is trained in a fully supervised manner. It is thus neces- 
sary to devise phonetic recognition schemes which do not rely upon fully supervised 
training. Two possible alternatives that are investigated in this thesis are: 


e To train a phonetic recognizer in an unsupervised fashion. 
e To train a phonetic recognizer using an alternate database which is labeled. 


In developing either of these approaches three main issues must be addressed. 
These issues are: 


> 


e How will the segmentation probability Pr(S | a,f) be modeled? 
e How will the phonetic classification probability Pr(C | S, a, f) be modeled? 


e What will the set of phonetic units be? 


3.4.2 Phonetic Recognition Utilizing Unsupervised Training 
Determining the Best Segmentation 


In segment-based approaches, a model for segmentation must be defined. One ap- 
proach to modeling the probability Pr(S | a, f) is to model the probability of the 
existence of the boundaries which define the segmentation. This approach is used in 
segment-based approaches such as the stochastic explicit-segment modeling approach 
proposed by Leung et al. [22]. Because of the tremendous number of possible seg- 
mentations which can exist, it is desirable to limit the segmentation search space to 
a small subset of likely segmentations. One means of accomplishing this search space 
reduction is to use a hierarchical segmentation algorithm such as the one developed 
by Glass [8, 9]. In Glass’s approach, a dendrogram produced from the spectral in- 
formation of the signal provides a well organized segmentation search space. The 
dendrogram is produced by a hierarchical clustering algorithm which clusters seg- 
ments that are adjacent in time using an acoustic similarity measure. 
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For the unsupervised approach, the search for the best segmentation S will be 
considered independent of the search for the best linguistic sequence C’. Thus the 
search for the best segmentation S can be represented with the expression 


> 


max Pr(S | a, f). (3.2) 


Since the corpus is not labeled the actual segments within the training data are not 
known. This makes it impossible to develop an automatic method for finding the 
single best segmentation which is trained in a supervised manner. Thus, a different 
means for finding the best segmentation must be devised. One possible way to select 
a single segmentation is to set a threshold on the acoustic similarity measure used in 
the dendrogram. This threshold would allow two adjacent segments to be clustered 
together into one segment only if their acoustic similarity exceeds the threshold. The 
final segmentation S isthe segmentation that exists when none of the adjacent clusters 
have an acoustic similarity exceeding the threshold. For this thesis, the threshold 
was selected by examining the segmentation output of training utterances in the OGI 
database, and compromising on a value which limits segment boundary deletions at 
the expense of increased segment boundary insertions. 


Determining the Set of Phonetic Classes 


For an approach which utilizes unsupervised training, an automatic method for deter- 
mining the set of phonetic elements must be used. One simple means for achieving this 
is to use an unsupervised clustering algorithm. For this thesis, the k-means clustering 
algorithm is used [4]. The algorithm clusters segments extracted from the training 
data based on similarity of their acoustic feature vectors. The segment-based acous- 
tic feature vector in this case consists of 14 MFCC values averaged over the length 
of the segment. The entire set of segment-based feature vectors in the training set 
are rotated using principal component analysis. The vectors are then scaled by the 
inverse covariance matrix of the entire set of vectors. The rotation and scaling trans- 
forms the original vectors into a set of vectors which contain statistically independent 
components where each component has a variance of one. The k-means algorithm 
then utilizes a Euclidean distance metric in its iterative clustering procedure. 

When using the k-means algorithm for this purpose, the hope is that each cluster 
provided by the algorithm will approximately correspond to a specific broad phonetic 
class (i.e. vowel, fricative, nasal, etc.). Figure 3.2 shows the average Mel frequency 
spectral coefficient (MFSC) values for each of the clusters found from the k-means 
algorithm for an experiment where the number of clusters was set to four. As can 
be seen in the figure, the clusters vary predominately in their energy and do not 
have extremely distinctive spectral shapes. Unfortunately, this empirical evidence 
suggests that the clustering algorithm does not provide clusters for the OGI corpus 
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which adequately correspond to broad phonetic classes.” Nevertheless, the clustering 
algorithm was used to create a series of codebooks where the number of entries in the 
codebooks was varied from 2 to 58. 
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Figure 3.2: Average MFSC values for 4 clusters found with the k-means algorithm 


Determining the Best Phonetic String 


When the k-means algorithm is used to find a codebook of phonetic units, phonetic 
classification is performed using vector quantization (VQ) [24]. The use of VQ pro- 
vides a simple method for modeling the probability Pr(C | S$, a, f) which is used in 
determining the string C. For each segment, the VQ algorithm simply chooses the 
one codebook entry which most closely matches the acoustic feature vector for that 
segment. In essence, this is equivalent to assigning a probability of one to the most 


similar codebook entry and a probability of zero to all other codebook entries. 


5It should be noted that it may be possible to generate codebooks whose entries more closely 
resemble broad phonetic classes by using a more sophisticated clustering algorithm than the one 
presented here. An approach which ignores the energy of each segment may also be preferable. 
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3.4.3 Phonetic Recognition Utilizing an Alternate Database 
The NTIMIT Database 


A second possible alternative to completely supervised training is to train a pho- 
netic recognizer on data from an alternate database. One alternative database that 
could be used is the NTIMIT corpus [14]. NTIMIT contains the utterances from 
the TIMIT corpus passed through a telephone network [17, 43, 44]. The use of the 
NTIMIT corpus for training could cause problems for the phonetic recognizer for two 
reasons. First, the microphones used in collecting the TIMIT data and the phone 
line channel that the data was passed through may be quite different from the tele- 
phone microphones and channels used by the subjects in the OGI corpus. While 
it is possible that the acoustic differences between the NTIMIT data and the OGI 
data could be significant, without phonetic transcriptions for the OGI data, it is not 
possible to quantitatively measure how these differences affect the reliability of the 
phonetic recognizer when it is used on the OGI data. The second problem associated 
with training the recognizer using the NTIMIT data is that the NTIMIT corpus only 
contains utterances collected in English. Because the phones used in English do not 
comprise the full set of phones used across all the languages in the OGI corpus, highly 
accurate phonetic recognition can not be achieved. However, it is hoped that, despite 
the differences between the phone sets of each language, the phonetic labels used in 
NTIMIT can be collapsed into broad phonetic classes that generalize well across all 
languages. If this is the case then an accurate multi-language broad phonetic class 
recognizer may be trained using data from only the English language. 


The SuMMIT Phonetic Recognizer 


SUMMIT is a segment-based speech recognition system which was developed by the 
Spoken Language Systems Group at MIT. SUMMIT utilizes Glass’s hierarchical seg- 
mentation algorithm to provide the segmentation search space. An ordered list of 
potential phoneme candidates and their respective likelihoods are produced for each 
potential segment. The phoneme likelihoods are obtained from mixture Gaussian 
density functions for each phoneme which model segment-based feature vectors. A 
search algorithm is applied to the segmentation and phoneme search space to find the 
most likely strings of phonemes. A more detailed description of the SUMMIT system 
is provided in [37], [40] and [41]. 

For this thesis, SUMMIT will be used as the phonetic recognition component of the 
ALI system. To accomplish this, SUMMIT was trained in a fully supervised fashion 
using the NTIMIT corpus. On NTIMIT, SUMMIT achieved a phonetic recognition 
accuracy of 60.5 %. Using SUMMIT, the most likely segmentation and string of English 
phonemes can be found for each utterance in the OGI corpus. 
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Choosing the Set of Phonetic Units 


If the SUMMIT system trained on NTIMIT is used, a means for selecting the set of 
phonetic classes that will be used must be determined. For this thesis, the number 
of phonetic classes will be varied to examine the effects of using sets of broad pho- 
netic classes versus sets of more detailed phonetic classes.® Strings of broad phonetic 
classes will be less likely to contain errors than strings using more detailed phonetic 
elements. However, strings containing more detailed phonetic elements could provide 
more information if the error rate of the phonetic recognizer is not too overwhelming 
and there is an adequate amount of training data. 

Since the number of phonetic classes used will be varied, a means of determining 
useful phonetic classes as the number of classes is altered must be devised. One po- 
tential means of accomplishing this is to create a phonetic hierarchical structure in 
which the phonemes are clustered according to a similarity measure. If a phonetic lan- 
guage model is to be used in the language identifier, then a useful similarity measure 
might compare the contexts in which specific phonemes or phonetic classes appear. 
By way of example, if two phonemes always appear within similar contexts across 
all languages, then little detail would be lost by combining the phonemes into one 
larger class.’ Figure 3.3 shows a hierarchical phonetic clustering which was obtained 
by clustering phonemes based upon the contexts in which they appeared in SUMMIT’s 
automatic transcriptions of the training data. Table 3.1 shows the set of phonetic 
classes that can be extracted from the hierarchical clustering when the number of 
classes is set to ten. 

To obtain the hierarchical phonetic structure in Figure 3.3, clustering was per- 
formed in a bottom-up manner. The similarity measure used for the clustering was 
the divergence between the probability distributions of the different phones. In this 
case, the distribution for each phone measured the probability of all of the phone’s 
possible left and right contexts. To describe the divergence measure mathematically 
let P represents the probability distribution for the expression Pr(c), c, | c). Thus, the 
distribution P contains a probability for all possible left and right phonetic contexts, 
c, and c,, for the phone c. Similarly, let P represent the probability distribution 
Pr(c;,c, | €). Using this notation, the divergence measurement between the distribu- 
tions P and P can be expressed as 


Pr(c, c | ©) 


D(P||P) = > (Prtejyc | c) = Priese | é)) log Pr(q,c | 0 


Cl,Cr 


(3.3) 


®The number of classes that can be used has an upper limit of 59. This is the number of distinct 
phonetic labels that are used by SUMMIT. 

"Other similarity measures might also prove useful. One possible alternative measure is the 
acoustic similarity between phones. 
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Figure 3.3: Hierarchical clustering of NTIMIT phones into broad phonetic classes 
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English phonemes in class 
Lora 5 
PIrPxeavyAMwWoasxre 


pe do 2 gf ka 
ofO6sz45 mmv 
tdpbkg 


Table 3.1: Set of ten automatically selected phonetic classes 


|b Oe pee 
| 2 |[tdpbkghhmmnnnnprsz0dfve? 


| 56 [wwoorovalls 


Table 3.2: Set of seven manually selected phonetic classes 


As can be seen in Figure 3.3 and Table 3.1, the automatic clustering algorithm 
roughly clusters the phonemes into generic broad phonetic clusters. However, due to 
sparse data for a few of the phones, such as /a/, /n/, and /U/, some of the clusters 
produced by the hierarchical clustering algorithm are contrary to intuition. Therefore, 
a number of sets of manually selected broad phonetic classes were also created. These 
sets of phonetic classes were chosen so that the elements of each class were similar 
first in their manner of articulation (i.e. vowel, consonant, closure, etc.) and second 
in their place of articulation (i.e. back, front, etc.). The two sets which proved the 
most effective in language identification experiments are shown in Tables 3.2 and 3.3. 
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English phonemes in class 

ea 0 ee 

Emr 

| 6 fey 

ce 

as 
h 


| 13 [mm 
———E——————e 
| 6 fed 
| 20 [ra 
ee 
p22 ttt 


Table 3.3: Set of 23 manually selected phonetic classes 


3.5 Language Identification 


3.5.1 Issues 


Using the framework discussed in Chapter 2, the language identification component 
of the system models the expression 


arg max Pr(a | C,$,f, L;) Pr($,f | C, L;) Pr(C | L;) Pr(L;). (3.4) 


Thus, the modeling issues involved in the language identification component of the 
system can be summarized with the following questions: 


e How will the a priori language probability, Pr(L;), be modeled? 


e What language model will be used to represent Pr(C | L;)? 


i? 


5, f, L;)? 


What prosodic model will be used to represent Pr(S,f | GC. 
S, 


What acoustic model will be used to represent Pr(a | C, 


3.5.2 <A Priori Language Probability 


The a priori language probability, Pr(L;), is perhaps the simplest element in the sys- 
tem to model. The only concern is determining how the language probabilities should 
be estimated. One potential solution is to say that all of the candidate languages are 
equally likely to be spoken. In effect this is the assumption that is made for the 
maximum likelihood approach to ALI. A different approach would be to attempt to 
estimate the probability of encountering each candidate language in the environment 
where the ALI system is to be used. Because the OGI corpus contains nearly equal 
amounts of data from each language, we assume that each candidate language is 
equally likely to be spoken. With this assumption the term Pr(Z,) can simply be 
ignored in the language identification process. 


3.5.3 Language Model 
Overview 


The language model is used to represent the expression Pr(C' | L;). The language 
model is potentially the most important element of the system. As House and 
Neuburg showed, simple language models applied to error free sequences of broad 
phonetic classes can reliably identify the language of an utterance. In this thesis, an 
n-gram language model is investigated. More specifically, the unigram, bigram, and 
trigram model are examined independently, as well as in combination. 
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Basic n-gram Modeling 


Some simple assumptions are made in the derivation of the n-gram model. For a 
unigram model each phonetic element is assumed to be statistically independent of 
all other phonetic elements. This can be expressed mathematically as 


rs 


Pr(C | Lj) = Pr(c1,€2,--- Cp | Li) = TJ Pr(ce | Li). (3.5) 


k 


1 


A bigram model assumes each phonetic element is statistically dependent on only the 
phonetic element immediately preceding it. This is expressed mathematically as 


Pr(C | L£;) = Pr(e, | L;) Il Pr(cy:| Ce-4; 44): (3.6) 


k=2 


Similarly, a trigram model assumes each linguistic element is statistically dependent 
on the two preceding phonetic elements. This is expressed as 


r p 
Pree: |g) = Pres | Ga) Pres | eys-25) II Pelee | epasee seis). (3.7) 


k=3 


To utilize these models for language identification, the probabilities for each lan- 
guage dependent n-gram model are estimated from histogram counts for each pho- 
netic element. The histograms are generated from the phonetic labels attached to 
the training utterances by the phonetic recognizer. To avoid the possibility of having 
probabilities of zero within the n-gram models, each histogram is initialized with an 
arbitrarily chosen minimum count floor of + where p is the number of phonetic classes. 

In evaluating the performance of the basic n-gram model there are four consider- 
ations that must be taken into account. These considerations are summarized in the 
following questions: 


e How accurately does C represent the underlying string of phonetic elements? 
e How many phonetic classes are used to represent the elements in Ce 

e What is the value of n for the n-gram model? 

e How much training data is being used? 


The performance of the n-gram model is extremely dependent on the four issues 
stated above. 

The language model component of the system attempts to capture the phonotactic 
constraints of each of the languages using the n-gram statistics of C. In order to do 
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this properly, it is important that Cc represent the actual string of phonetic events as 
accurately as possible. House and Neuburg showed that the phonotactic constraints 
of languages are so strong that accurate language identification can be performed 
using simple n-gram models even when the string of phonetic events is modeled with 
elements as general as broad phonetic classes. However, the language identification 
capabilities of an n-gram will be degraded when the actual string of phonetic events 
is corrupted with errors. 

The introduction of errors into the phonetic string has one major consequence 
with respect to n-gram modeling. As the phonetic recognition error rate within C 
is increased, the probability distributions within the n-gram models are shifted away 
from their actual distributions towards more uniform distributions. This shifting of 
the n-gram probability distributions towards more uniform distributions decreases 
the language discrimination abilities of the n-gram model. It should also be noted 
that this effect becomes greater as the size of the n-gram model is increased. This 
can be attributed to the fact that more past information must be used as the value 
of n is increased. In other words, when n is greater than one, the n-gram model is 
subjected not only to errors in the current phonetic element but also to errors in the 
previous phonetic elements that are used for the context dependency of the model. 

To examine the effect the inventory of phonetic elements used in the phonetic 
string has upon the n-gram model performance, the n-gram model was tested using 
the two different methods for determining C that were described earlier (i.e., the 
SUMMIT phonetic recognizer and the vector quantizer). When the SUMMIT phonetic 
recognizer was used, the string of phonetic classes was determined by collapsing the 
detailed labels produced by SUMMIT into the phonetic classes produced by the hier- 
archical clustering shown in Figure 3.3. Figure 3.4 shows the language identification 
accuracy of the unigram model using the phonetic string output of both the SUM- 
MIT phonetic recognizer and the vector quantizer as the number of phonetic classes 
is varied from 2 to 59. The accuracy is shown for both the training and test sets. 
Figure 3.5 shows the rank order statistic for the same set of experiments. 

As can be seen in Figure 3.4 and Figure 3.5 the unigram model using the SUMMIT 
supplied phonetic string outperforms the unigram model using the vector quantizer’s 
phonetic string as the number of phonetic classes is increased beyond 10. This is 
expected since the SUMMIT recognizer provides a more accurate phonetic represen- 
tation of the utterance than the vector quantizer. However, when the number of 
classes is less than ten, the unigram model performs better with the vector quantizer 
than with SUMMIT. Figures 3.6 and 3.7 show the same experiments using a bigram 
model instead of a unigram model. Similar to the unigram model, the bigram model 
performs better using the SUMMIT recognizer than it does using the vector quantizer 
when the number of phonetic classes is selected to be greater than 7. 
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Figure 3.4: Accuracy of unigram model using two different phonetic recognizers as 
the number of phonetic classes is varied 
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Figure 3.5: Rank order statistic of unigram model using two different phonetic rec- 
ognizers as the number of phonetic classes is varied 
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Figure 3.6: Accuracy of bigram model using two different phonetic recognizers as the 
number of phonetic classes is varied 
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Figure 3.7: Rank order statistic of bigram model using two different phonetic recog- 
nizers as the number of phonetic classes is varied 
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Determination | Number anguage ID | Rank Order 
n-gram Model | of Classes of Classes Accuracy Statistic 


a aa 


[Manual [| 23—~«4 SCL SC 
Tigra [Automatic [7 || 274 [ 370 _| 
[Mamal | 7 | 348 [327 _ 


Table 3.4: Performance of n-gram models using automatically and manually selected 
broad phonetic classes obtained from the phonetic labels provided by SUMMIT 


To further examine the effect the representation of C has on the language model 
performance, several experiments were also conducted using manually selected broad 
phonetic classes instead of the automatically selected classes produced by the hierar- 
chical clustering. Table 3.4 compares the performance of the best bigram and trigram 
models using the automatically selected phonetic classes with the performance of the 
bigram and trigram models using the manually selected classes shown in Tables 3.2 
and 3.3. As can be observed in Table 3.4, the performance of n-gram models was bet- 
ter using the manually selected classes than the automatically selected classes. These 
results further demonstrate the importance of using meaningful phonetic classes in 
the representation of the phonetic string. 

The amount of available training data is also a primary concern. The size of the n- 
gram model and the number of phonetic classes should be chosen to provide as much 
detail as possible. However, increasing the detail of the modeling also increases the 
amount of data required for proper training. In general, as the number of parameters 
in the n-gram model is increased, the training requirements of the model are also 
increased. If n is the size of the n-gram model and p is the number of phonetic 
classes in the phonetic string, then the n-gram model for each language contains p” 
parameters that must be estimated. Thus, as either n or p is increased, the detail in 
the n-gram model is increased, thereby increasing the amount of data that is needed 
for proper training. 

The tradeoff between the increased detail and the increased training requirements 
can be observed in Figures 3.8 and 3.9. For small numbers of phonetic classes, the 
trigram model outperforms the bigram and unigram models. This is expected since 
the trigram provides longer distance constraints in modeling the phonetic string than 
the bigram and unigram models. However, as the number of classes is increased the 
performance of the trigram drops off severely due to the lack of sufficient amounts 
of data to properly train the large number of parameters in the trigram model. A 
similar effect is seen between the performance of the bigram and unigram models. 
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The bigram model easily outperforms the unigram when the number of classes is less 
than 40. However, as the number of classes is increased, the bigram experiences the 
same drop in performance as the trigram model due to insufficient training data. As 
the number of classes is increased above 50, the unigram model begins outperforming 
both the bigram and trigram models, as it has considerably fewer parameters that 
need to be trained. 


Language Identificaton Accuracy (%) 


0 10 20 30 40 50 60 
Number of Phonetic Classes 


Figure 3.8: Language identification accuracy of n-gram models using the SUMMIT 
phonetic recognizer with automatically selected classes as the number of phonetic 
classes is varied 
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Figure 3.9: Rank order statistic of n-gram models using the SUMMIT phonetic recog- 
nizer with automatically selected classes as the number of phonetic classes is varied 
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To examine the training requirements of each of the n-grams models in more detail, 
each n-gram model was tested using varying training set sizes. Figures 3.10, 3.11 and 
3.12 show how the performance of the unigram, bigram and trigram models varies 
as the training set size is increased from 10 speakers per language to 50 speakers 
per language. An examination of Figure 3.10 reveals that very little improvement 
in performance is likely to be gained in the unigram model by simply increasing 
the number of training speakers beyond 50. However, significant improvements in 
the bigram and trigram models’ performance may be possible with only a moderate 
increase in the number of training speakers. 
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Figure 3.10: Performance of unigram model using the SUMMIT recognizer with auto- 
matically selected classes as the number of training speakers per language is altered 
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Figure 3.11: Performance of bigram model using the SUMMIT recognizer with auto- 
matically selected classes as the number of training speakers per language is altered 
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Figure 3.12: Performance of trigram model using the SUMMIT recognizer with auto- 
matically selected classes as the number of training speakers per language is altered 
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Interpolated n-gram Modeling 


One means of reconciling the tradeoff between the advantage of increased detail and 
the disadvantage of limited training data as the number of phonetic classes is increased 
is to utilize an interpolation approach for combining the different n-gram models [15]. 
The idea behind interpolation is to utilize the strength of larger n-gram models when 
sufficient training data is available for specific contexts but to rely more heavily 
on smaller n-gram models when the training data available for a specific context is 
limited. 

The simplest interpolated n-gram model is the interpolated bigram model. The 
interpolated bigram is described by the expression 


P(e | G1) = APr(q | G1) + (1 — A) Pr(q). (3.8) 


In (3.8) the interpolated bigram probability P(c; | c¢;_1) is modeled as a linear inter- 
polation of the estimated bigram probability, Pr(c; | c;_1), and the estimated unigram 
probability Pr(c;). The interpolation factor is chosen to place more weight on the 
estimated bigram when there are enough exemplars of c;_; in the training data to 
properly estimate the bigram probability. When there are limited exemplars of c_1 
in the training set, shifts the weight onto the estimated unigram probability. The 
formula for the interpolation factor is 
Ke 
o— ee (3.9) 

where k,,_, is the number of exemplars of cj_; in the training set and K is a constant. 
Ideally, K should be set to a value which optimizes the performance of the interpolated 
bigram model on the language identification task. 

The interpolated bigram can be expanded to larger interpolated n-gram models 
in a simple recursive fashion. For example, the interpolated trigram is represented 
with the expression 


P(c; | G1, G2) = Ae Pr(G | G_-1,cq-2) + - d2)P (cj | G1) (3.10) 
which expands to 


P(c; | Ci-1, G@—-2) = Ae Pr(G | G1, G2) + (1 — Ae) (Ar Pr(cG; | G1) — (A — A1) Pr(q)). 
(3.11) 
Tests on data jackknifed from the training set revealed that an appropriate K 
value for the interpolated bigram is 150. Similarly, a value of 1200 was found to be 
an appropriate K value for the interpolated trigram. Thus A; can be expressed as 
k 


SS 3:12 
Key + 150 Ste) 
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and A» can be expressed as 


Kc ccos 
he = Se 3.13 
* Ke;1.e:-9 + 1200 ot) 


Figure 3.13 shows the performance of the interpolated bigram and trigram models 
in comparison to the standard unigram, bigram and trigram models. As can be 
observed in the figure, the interpolated bigram and trigram models outperform the 
standard n-gram models. Additionally, the interpolated models do not experience 
any drop in performance as the number of phonetic classes is increased although 
their performance does level off as the number of phonetic classes is increased beyond 
30. When 59 phonetic classes were used the interpolated trigram achieved a language 
identification accuracy of 41.7% and a rank order statistic of 2.78. It should also be 
noted that the interpolated trigram offers only a slight improvement in performance 
over the interpolated bigram. 


3.5.4. Prosodic Model 


Overview 


The prosodic model is used to represent the expression Pr(S,f | C,L;). Ideally, 
this model can be used to capture the differences among languages that exist in the 
prosodic structure of utterances. To accomplish this the model should incorporate 
knowledge about the manner in which word and sentence level stress are incorpo- 
rated into utterances as well as the usage of tones. Unfortunately, while useful and 
reliable methodologies are available for modeling acoustic and phonetic information, 
well-developed techniques for automatically capturing and understanding prosodic 
information are not yet available. Therefore, for these experiments, the prosodic 
model is only used to capture simple statistical information about the fundamental 
frequency and the segment duration information of an utterance. 

To help simplify the modeling, the expression for the prosodic model can be ex- 
panded as follows: 


Pr(S,f | C,L;) = Pr(f | S,C, L;) Pr($ | C, Lj). (3.14) 


With this expansion the prosodic model can be expressed as the product of two 
separate models, a fundamental frequency model and a segment duration model. 


Fundamental Frequency Model 


The expression Pr(f | S, 0, 14) can be used to capture the information available in 
the FO contour of an utterance. Although, there may be correlation between the FO 
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Figure 3.13: Performance of n-gram and interpolated n-gram models as the number 
of phonetic classes is varied 


contour and the durations of the segments in the utterance, this correlation will be 
ignored for these experiments in order to simplify the modeling of the FO contour. 
Thus, f will be considered independent of S and C. With these assumptions the 
fundamental frequency model can be simplified as follows: 


Pr(f | $,C,L;) = Prif | Lj). (3.15) 


While there may be useful information available in the dynamics of the FO con- 
tour, a method for modeling these dynamics over time for the purpose of language 
identification is not yet obvious. Some of this dynamic information is presumably 
captured in the delta FO values contained in f. To simplify the modeling, each frame 
will be considered to be statistically independent. With this assumption the FO model 
can be written as 


Pr(f | £,)= [] Pech it) (3.16) 


om 


k 


1 


where m is the number of frames in the utterance and ik is a feature vector repre- 
senting the FO and delta FO values for the k* frame. It should be mentioned that 
the computation in (3.16) only includes the frames which are voiced. 

The expression in (3.16) can be modeled with a mixture of full covariance Gaus- 
sian probability density functions. To create the mixture Gaussian model for each 
language, the set of Gaussians density functions within each mixture are initialized 
from a set of clusters found with the k-means clustering algorithm. The Gaussians 
in each mixture are then iteratively reestimated to maximize the average likelihood 
score of the vectors in the training set. 

To find the number of Gaussians within each mixture which is sufficient for mod- 
eling the probability density function of the FO vectors in each language, the perfor- 
mance of the FO model was examined as the number Gaussians in each mixture was 
varied from 1 to 24. Figures 3.14 and 3.15 show the performance of the FO model 
as the number of Gaussians per mixture is varied. As can be seen, the performance 
of the model levels off as the number of Gaussians per mixtures is increased to 9 or 
higher. When 9 Gaussians per mixture are used, the FO model achieves a language 
identification accuracy of 23.1% with a rank order statistic of 4.01. 


Segment Duration Model 


The expression Pr(S | C,L;) can be used to capture the segment duration infor- 
mation in a utterance. While there may be very useful information contained in iS 
regarding the stress patterns of the syllables, words and sentences in each utterance, 
this information could require fairly complex modeling and as such will be ignored for 
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Figure 3.14: Accuracy of FO model as the number of Gaussians per mixture is varied 
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Figure 3.15: Rank order statistic of FO model as the number of Gaussians per mixture 
is varied 
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these experiments in deference to simplicity. As a simplifying assumption each seg- 
ment will be considered independent of all other segments. With this independence 
assumption, the segment duration model can be rewritten as 


m 
Pr(S | C, Li) = Tf Pr(dalcs, Li) (3.17) 
k=1 
where m is the number of segments in the utterance and d, is the duration of the 
k*® segment. 

For these experiments the duration d; is expressed as the number of frames con- 
tained within the segment. With this consideration, the expression in (3.17) can be 
modeled directly with non-parametric probability distributions. A probability dis- 
tribution is created for each phonetic class in each language from histograms which 
count the number of times specific durations occur in the training data. To help 
smooth the tail of each histogram (i.e., the histogram bins corresponding to long seg- 
ment durations), a minimum count floor was applied to each histogram bin and each 
histogram was smoothed with a low pass filter (i.e., Parzen windowing) before being 
used to generate the duration probability distributions. 

Figures 3.16 and 3.17 show the performance of the segment duration model on the 
training and test data as the number of phonetic classes is varied from 1 to 59. As can 
be seen in the figures, the model suffers from insufficient training when larger numbers 
of classes are used. The peak performance of the segment duration model occurs when 
29 phonetic classes are used. With 29 phonetic classes, the model achieves a language 
identification accuracy of 27.6% and a rank order statistic of 3.65. 


3.5.5 Acoustic Model 


The expression Pr(a | £,5,C, [;) is called the acoustic model. This model is used to 
capture information about the acoustic realizations of each of the phonetic elements 
used in each language. To simplify the modeling, the acoustic information a will 
be assumed independent of the fundamental frequency information f. With this 
assumption the acoustic model can be rewritten as follows: 


Pr(a | f,$,C,L;) = Pr(a| S,C, Lj). (3.18) 
To further simplify the expression, each segment will be considered independent of 
all other segments. With this assumption the acoustic model can be expressed as 
k=1 


where m is the number of segments in the utterance, and dG; is a segment-based feature 
vector describing the acoustics of the k** segment. 
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Figure 3.16: Language identification accuracy of segment duration model as the num- 
ber of phonetic classes is varied 
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Figure 3.17: Rank order statistic of segment duration model as the number of phonetic 
classes is varied 


Using the above assumptions, continuous probability density functions which 
model the segment-based acoustic feature vectors for each phonetic class in each 
language can be used for the acoustic model. The acoustic feature vectors in this 
case contain the values of each of the 14 MFCC’s and 14 delta MFCC’s averaged over 
the length of each segment. For this thesis, the acoustic feature vectors are modeled 
with mixtures of diagonal Gaussian density functions. To insure proper amounts of 
training data for each mixture of Gaussians, the number of Gaussians used to model 
each phonetic class follows the equation 


if k/100 > Mmax 


Nmax 
asap = [k/100] otherwise (3.20) 


where Ngaussians 18 the number of Gaussians used in the mixture Gaussian model of a 
particular phonetic class for a particular language, Nmaz, is the maximum number of 
Gaussians allowed in each mixture, and & is the number of training vectors for the 
phonetic class in that particular language. 

To find an adequate maximum number of Gaussians to use within each mixture, 
the performance of the acoustic model was examined as the maximum number of 
Gaussians was varied from 1 to 28. Figures 3.18 and 3.19 shows the performance of 
the acoustic model over varying numbers of Gaussians per mixture. As can be seen in 
Figure 3.19, the performance of the acoustic model begins to level off as the maximum 
number of Gaussians is increased beyond 13. Using a maximum of 16 Gaussians per 
mixture, the acoustic model achieves a language identification accuracy of 37.9% with 
a rank order statistic of 3.27. 


3.5.6 System Integration 


To complete the ALI system, each of the individual models must be integrated into 
one system. Since the system is seeking the language which is most likely given 
the acoustic information, the probability scores from each individual model for an 
utterance must be combined to provide one probability score for each language. Using 
the probabilistic framework, this can be accomplished with the following expression: 


max Pr(a | C,$,£, L;) Pr($,f | C, L;) Pr(C | L;). (3.21) 


To prevent underflow errors in the computation, the logarithm of the expression can 
be taken to yield the following expression: 


max log (Pr(a | C, 8, f, Li) Pr($,f | C, L;) Pr(C | L,)). (3.22) 
The expression in (3.22) can further be expressed as a sum of logarithms yielding 


max (log(Pr(a | C, 9, f, L;) + log Pr(S,f | C,L;) + log Pr(C | L;)). (3.23) 


57 


Ww W Ww Ww WwW WwW 
FEF nN Dn IN DO OO 


Language Identification Accuracy (%) 
1S) 
ies) 


0 5 10 15 20 25 30 
Number of Gaussians per Mixture 


Figure 3.18: Accuracy of acoustic model as the number of Gaussians per mixture is 
varied 
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Figure 3.19: Rank order statistic of acoustic model as the number of Gaussians per 
mixture is varied 


Model Name Model Description 


Language Model Interpolated trigram models 
using 59 phonetic classes 


Segment Duration Model | Non-parametric probability distributions 
using 29 phonetic classes 


Mixtures of 9 full covariance Gaussians 
Acoustic Model Mixtures of 16 diagonal Gaussians 
using 59 phonetic classes 


Table 3.5: Summary of individual models used in final ALI system 


Thus, the log likelihood score for each language can simply be represented as the sum 
of the log likelihood scores for each of the individual models. 

Using the log likelihood approach, the language, acoustic, FO, and segment du- 
ration models can easily be integrated into the final system. A summary of the 
individual models that are used in the final system is presented in Table 3.5. The 
system uses 59 phonetic classes in the representation of the phonetic string C. How- 
ever, because the segment duration model cannot be sufficiently trained when it uses 
59 classes, the elements of C are collapsed into 29 phonetic classes for the segment 
duration model. 

Unfortunately, when the final system uses the simple addition of log likelihood 
scores with equal weights as described above, the final log likelihood score for each 
language is dominated by the FO model score. To examine the scores of each model 
in more detail, the final log likelihood scores for each model for each utterance can 
be converted into the a posteriori language probabilities. For example, the language 
model can be represented with the following equation: 

_ Pr(; | €) Pr(@) 


Pr(C | L;) = Pi) (3.24) 


From (3.24) the a posteriori language probability can be expressed as 


Suits tt oe) epeoris (3.25) 

Pr(C) 
where k is a constant value for each utterance. The value of & can be calculated easily 
given the condition that the a posteriori language probabilities over all 7 must sum to 
one. Once k is calculated for a specific utterance, the language model scores for that 
utterance can easily be converted into the a posteriori language probabilities. The a 
posteriori probabilities for any of the other models can be found in the same fashion. 
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Average A Posteriori Actual 
Probability Language ID 
Model Name of Top Choice Accuracy 


Language Model 
Segment Duration Model 


Table 3.6: Average a posteriori probability of top choice language vs. actual language 
identification accuracy for each model on data jackknifed from the training set 


The average a posteriori probability of a model’s top choice language provides 
a measure of the certainty in which a model believes its top choices are correct. 
Therefore, it is expected that a sound model will achieve an actual accuracy which 
is approximately equal to the average a posteriori probability of its top choice. The 
average a posteriori language probabilities of the top choice language for each model 
was found from utterances which were jackknifed from the training data.* Table 3.6 
shows the comparison between the average a posteriori language probability of the 
top choice language and the actual language identification accuracy for each model. 
As can be seen in the table, the average top choice probability is larger than the actual 
language identification accuracy for each of the models. This indicates that the top 
choice probabilities are being inflated significantly higher than they actually should 
be. This may be due to that fact that the assumptions regarding the independence 
of segments (or frames) which are made in each of the models allow biases due to 
speaker and channel dependencies to accumulate across the length of the utterance. 
This effect is most prevalent in the FO and acoustic models but is also present to a 
lesser degree in the language and segment duration models. 

To compensate for the discrepancy between the average top choice probability and 
the language identification accuracy of each model, the log likelihood score of each 
model can be multiplied by an artificial scaling factor. Multiplying the log likelihood 
scores of a model by a scaling factor will effectively compress or expand the range of a 
posteriori probabilities for that model. Scaling factors less than one will compress the 
range of the a posteriori probabilities causing the set of probabilities to become more 
uniform. For the four models used in the system, scaling factors which compress the 
range of a posteriori probabilities are appropriate. More specifically, the scaling factor 


8The training data was divided into 5 unique sets of 40 speakers per language for training and 
10 speakers per language for jackknifed testing for this experiment. 
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Log Likelihood 


Acoustic Model 


Table 3.7: Log likelihood scaling factors for each model 


for each model is chosen to adjust the top choice average a posteriori probability of 
the model so it is equal to the model’s actual language identification accuracy. The 
scaling factors for each model are shown in Table 3.7. Using the scaling factors shown 
in Table 3.7, the final system was able to achieve a language identification accuracy 
of 48.6% with a rank order statistic of 2.51. 
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Chapter 4 


Analysis 


4.1 Overview 


When evaluating an ALI system it is important to examine the system’s performance 
from a variety of perspectives. Examining a simple statistic such as the system’s over- 
all language identification accuracy or rank order statistic may not provide sufficient 
insight into the various factors which contribute to the system’s performance. It is 
important to understand how the performance is affected as various test conditions 
are altered. It is also important to understand the types of errors that are made and 
the severity of these errors. Some of the important issues that should be examined 
can be summarized in the following questions: 


Which types of information (i.e., phonetic, acoustic, prosodic) are most useful 
for language identification? 


How is the system performance affected by the length of an utterance? 


How is the system performance affected by the vocabulary and speaking style 
constraints of an utterance? 


How is the system performance affected by the size of the training set? 
What types of errors does the system make? 
How is the system performance affected by alterations in the language set? 


What are the receiver-operator characteristics of the system? 


An examination of these issues is important in determining the strengths and weak- 
nesses of a system. A clear understanding of a system’s advantages and drawbacks 
is necessary if efforts to improve the system’s design and performance are to be suc- 
cessful. 
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4.2 Performance of Individual Models 


In examining the performance of the ALI system, it is desirable to understand which 
information utilized by the system is the most useful. To accomplish this, the perfor- 
mance of each of the different components of the system can be examined separately 
as well in combination with other components. As a review, Table 4.1 summarizes 
the the properties of the different models used by the system. 

The performance of the ALI system using different combinations of the system’s 
models is shown in Table 4.2. The results throughout the table show that the language 
model is the most important model for language identification. This finding supports 
House and Neuburg’s belief that the phonotactic constraints of languages provide 
information that is useful for language identification. However, the results also show 
that improvements in performance can be gained by using additional information to 
supplement the language model. 

The table shows how each of the other models (i.e., the acoustic, duration and FO 
models) performs when used in conjunction with the language model. Despite the 
fact that the FO model is the weakest of all of the models when used on an individual 
basis, the FO model contributes more than the acoustic or duration models when 
used in conjunction with the language model. Similarly, when the FO and duration 
models are combined to form the prosodic model, the prosodic model contributes 
more to the overall system than the acoustic model. One possible explanation for this 
behavior is that there may be a larger correlation between the information carried 
in the language and acoustic models than there is between the information of the 
language and prosodic models. As such, the prosodic model may be supplementing 
the language model with more independent information than the acoustic model. 

Additionally, it is interesting to note that the performance of the system using 
only the combination of the prosodic and acoustic models is nearly as high as the 


Language Model Interpolated trigram models 
LIN _| ming s0 poner cases 
Segment Duration Model | Non-parametric probability distributions 

(esiacasesaiindaiecen using 29 phonetic classes 


Mixtures of 9 full covariance Gaussians 
Acoustic Model Mixtures of 16 diagonal Gaussians 
using 59 phonetic classes 


Table 4.1: Summary of individual models used in final ALI system 
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performance using only the language model. This further indicates that, while the 
phonotactic constraints of languages may be powerful, prosodic and acoustic infor- 
mation are also useful for language identification. 


Language Rank 
Identification | Order 
Set of Models Accuracy Statistic 


Language Model 

Acoustic Model 

Duration Model 

FO Model 

Duration + FO (i.e., Prosodic Model) 
Language + FO 2.61 
Language + Acoustic 2.69 
Language + Duration 22 
Language + FO + Acoustic 2.60 
Language + FO + Duration 2.57 
Language + Acoustic + Duration 2.63 
Language + Prosodic 2.57 
Language + Acoustic 2.69 
Prosodic + Acoustic 2.86 


Complete System 48.6% 


Table 4.2: System performance using varying sets of models 


4.3 Performance Over Varying Utterance Lengths 


The performance of the system as the test utterance length is varied is shown in 
Figures 4.1 and 4.2. These plots were obtained by examining the system performance 
using only the first ¢ seconds of each utterance where t was varied from 1 second to 
45 seconds. For each value of t only the utterances with a length greater than ¢ are 
used. As expected, the system performs better as the utterance length is increased. 
On the unconstrained utterances, the system improved from an accuracy of 33.1% 
using 2 seconds of speech to 47.0% using 10 seconds to 56.8% using 45 seconds. 
Figure 4.3 shows the performance of the individual models over time. Figure 4.3 
reveals that the language model’s performance has a larger increase as the utterance 
length is increased than any of the other models. In fact, the language model performs 
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Figure 4.1: Language identification accuracy over varying test utterance length 
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Figure 4.2: Rank order statistic over varying test utterance length 
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as well as the complete integrated system as the length is increased past 40 seconds. 
The FO and duration models also incur significant improvements in their performance 
as the utterance length increases although not as large as the improvement in the 
language model. The acoustic model has a considerably smaller improvement in 
performance as the length is increased, and in fact, has no improvement as it is 
increased beyond 15 seconds. This may indicate that little additional information is 
gained by the acoustic model from any more than a few observations of each phone. 
It is interesting to note that for utterance lengths of 3 seconds or less, the acoustic 
model outperforms all of the other models. However, as the utterance length is 
increased beyond 20 seconds, both the language and duration models outperform the 
acoustic model. These results indicate that a language identification strategy which 
reduces the weight placed on the acoustic model score and increases the weight placed 
on the language model score as the test utterance length becomes longer may be more 
appropriate than the static weighting system that was used in this thesis. 


4.4 Performance Using Utterance Constraints 


Figures 4.1 and 4.2 also show the performance of the system using two different 
utterance constraints. The figures show the difference in performance using the topic- 
specific utterances versus the unconstrained utterances. As can be seen, the system 
performed significantly better using the topic specific utterances. Using nine seconds 
of speech from test utterances, the system achieved an accuracy of 53.3% on the 
topic-specific utterances while only achieving 44.2% on the unconstrained utterances. 
Figures 4.4, 4.5, 4.6, and 4.7 show the performance of each of the individual models 
on the topic-specific and unconstrained utterances. The figures show that each of 
models (with the possible exception of the duration model) performs better on the 
topic-specific utterances. 

Because of the vocabulary constraints of the topic-specific utterances, it is ex- 
pected that the language model component of the system would perform better on 
these utterances than on the unconstrained utterances. However, the figures also re- 
veal that the language identification performance of the acoustic and prosodic models 
was also better using the topic-specific utterances. This may partially be due to the 
fact that some acoustic and prosodic information, such as the stress patterns of spe- 
cific words, is correlated with the vocabulary. However, the acoustics and prosodics 
may also be affected by fundamental differences in the speaking styles used in the 
topic-specific and unconstrained utterances. The topic-specific utterances were all 
spontaneous replies to queries while the unconstrained utterances were not limited in 
any fashion. In fact, the unconstrained utterances contained examples of both spon- 
taneous and read speech, which are known to be different in their prosodic nature [3]. 
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Figure 4.3: Performance of individual models over varying test utterance length 
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Figure 4.4: Language model performance: topic-specific vs. unconstrained utterances 
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Figure 4.5: Acoustic model performance: topic-specific vs. unconstrained utterances 
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Figure 4.6: FO model performance: topic-specific vs. unconstrained utterances 
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Figure 4.7: Duration model performance: topic-specific vs. unconstrained utterances 
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4.5 Performance Over Varying Training Set Sizes 


Figure 4.8 shows the overall performance of the system as the training set size is 
varied. While the performance on the two data sets is converging as more training 
speakers are used, there is still a large gap between the performance on training and 
testing data even when the training set size is increased to 50 speakers per language. 
This indicates that there is still plenty of room for improvement in the system. 


—=— Complete System/Training Data 


Language Identification Accuracy (%) 


--2-- Complete System/Test Data 
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Figure 4.8: System performance over varying training set sizes 


Figures 4.9, 4.10, 4.11 and 4.12 show the performance over varying training set 
sizes for each of the individual models. These figures show that the language and 
acoustic models have a significant gap between the training and testing performance. 
This is most likely due to the fact that the language and acoustic models contain 
far more parameters to be trained than the duration and FO models. There is also a 
significant (albeit smaller) gap between the training and test set performance of the 
duration model. It is possible that the large gaps between the training and test set 
performance in the language, acoustic, and duration models could be decreased if the 
error rate of the phonetic recognizer was decreased. The FO model, on the other hand, 
has near convergence of its training and test set performance with a training set of 
50 speakers per language. This would indicate that the FO model could withstand a 
significant increase in its complexity without suffering from a lack of training data. 
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Figure 4.9: Language model performance over varying training set sizes 
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Figure 4.10: Acoustic model performance over varying training set sizes 
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Figure 4.11: Duration model performance over varying training set sizes 
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Figure 4.12: FO model performance over varying training set sizes 
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4.6 Analysis of Confusions 


The confusion matrix for the complete system is shown in Table 4.3. Several im- 
portant observations can be made about the errors present in the confusion matrix. 
First, there appears to be a bias in the system towards choosing an Indo-European 
language (English, German French, Spanish or Farsi) as the system’s top choice. Al- 
though only 51% of the test utterances are from an Indo-European language, 61% 
of the test utterances are classified by the system as one of the five Indo-European 
languages. Table 4.4 shows a breakdown of the confusions which occur between Indo- 
European and non-Indo-European languages. As a second observation, there are a 
few pairs of languages which have significantly larger than average confusion rates. 
In particular, the pairs English-German and German-French appear very confusable. 
This is understandable given that these languages are all from the Indo-European 
language family. However, Japanese and French, which do not belong to same family, 
are also very confusable. 


Input Output Hypothesis 
Utterance Eng Ger Fre Spa Far Tam Vie Man _ Kor Jap 


English 8.7 98.2 11.3 52 26 61 #43 43 £4417 
German 43.2 161 42 51 O08 O08 O8 O00 1.7 
French 7.8 63.5 26 26 00 09 35 3.5 43 


Spanish : 81 90 505 63 54 2.7 O9 2.7 7.2 


Farsi 16.2 99 2.7 468 O09 7.2 2.7 63 3.6 
Tamil : 18 2.7 142 88 513 62 18 44 71 
Vietnamese ‘ 3.7 7.5 56 84 6.5 383 3.7 10.3 6.5 
Mandarin 165 18 18 64 O09 O9 58.7 28 7.3 
Korean ; 64 128 73 83 00 46 3.7 422 64 
Japanese : 98 188 89 45 09 45 54 45 40.2 


Table 4.3: Confusion matrix of complete system (all values are percentages) 


4.7 Performance Over Varying Language Sets 


Examining the performance of the system on the task of pairwise language identifi- 
cation may provide a clearer picture of the confusions that can occur amongst the 
10 different languages. Table 4.5 shows the performance of the system when the lan- 
guage set is limited to a pair of languages. All 45 combinations of language pairs were 
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Input Output Hypothesis 
Utterance Indo-European Other 


Indo-European 85.1 14.9 
Other 36.2 63.8 


Table 4.4: Confusion matrix of Indo-European vs. non-Indo-European languages 


tested and are shown in the table. The average performance of each language within 
the pairwise tests is also shown. For example, in the context of English-D; where 
[;, can be any particular language other than English, the system had an average 
performance of 78.7% accuracy across all Lj. 

The language with the largest average performance in the pairwise tests was Tamil 
which achieved an average accuracy of 88.4% in the Tamil-L; pairwise tests. The 
best pairwise performance of the entire system was 92.6% for the Tamil-German pair. 
These results indicate that Tamil is the language which is most dissimilar from the 
rest of the languages based on the information used in the ALI modeling. 

As might be expected, the language pair with the poorest pairwise performance 
was the English-German pair with an accuracy of only 63.5%. The French-German 
pair was also highly confusable with an accuracy of only 75.5%. In fact, the aver- 
age performance across all pairs of the four European languages (English, German, 
French and Spanish) was only 75.7% which is considerably lower than the average 
of 83.2% across all language pairs. The system also experienced low pairwise perfor- 
mances with the Japanese-French pair (76.6%) and the Korean-English pair (74.1%). 
The confusions between these pairs are difficult to explain since both pairs contain 
languages from different language families. 

To further demonstrate the importance of the particular set of languages on the 
performance of the system, several experiments were conducted using three different 
sets of 5 languages. Table 4.6 shows the confusion matrix and performance of the 
system using the five Indo-European languages. Table 4.7 shows the confusion matrix 
and performance of the system using the five non-Indo-European languages. Table 4.8 
shows the confusion matrix and performance of the system using a set of five diverse 
languages. As can be seen, the system performs much better on the two sets which 
contain languages from different language families than it does on the set containing 
languages that are all from the Indo-European family. 

To attempt to extract any hidden structure that may be present in the pairwise 
performance matrix in Table 4.5, hierarchical clustering can be performed using the 
separate rows of the matrix. By filling in a value of 50 for all of the diagonal elements 
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| | Eng Ger Fre Spa Far Tam Vie Man Kor Jap 
63.5 77.4 77.9 

75.5 79.5 

80.5 


75.0 - 
79.5 80.5 - 
77.7 87.6 86.9 


92.6 92.5 78.6 
91.1 87.4 83.9 
85.4 90.6 87.7 
88.1 82.6 83.6 
84.3 76.6 78.9 
Average Performance Across All Language Pairs: 83.2 


Table 4.5: Performance of system on pairs of languages (all values are language 
identification accuracies in percentages) 


Input Output Hypothesis 

Eng Ger Fre Spa_ Far 
English 52.2 15.7 7.0 15.7 9.6 
German 27.1 449 161 68 5.1 


French 13.0 11.3 67.0 5.2 3.5 
Spanish 8.1 153 126 55.9 8.1 
Farsi 3.6 243 99 4.5 57.7 


Overall System Accuracy: 55.4 


Table 4.6: Confusion matrix and performance of system using the 5 Indo-European 
languages 
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Input Output Hypothesis 
Utterance Tam Vie Man Kor Jap 


Tamil (0.8 6.2. 3.3 88 8.8 


Vietnamese} 8.4 523 84 17.8 13.1 


Mandarin 28 13 ~ 761 “Ga 11:9 
Korean O93" “Ga. 119-6323." 14 
Japanese 3. 45 15.2 8.0 68.8 


Overall System Accuracy: 66.4 


Table 4.7: Confusion matrix and performance of system using the 5 non-Indo- 
European languages 


Input Output Hypothesis 

Eng Fre Far ‘Tam Man 
English Ol.7 A220 lla 3.5 113 
French 20.0 67.8 5.2 09 6.1 


Farsi 90 11.7 63.1 O09 15.3 
Tamil 8.0 44 106 726 44 
Mandarin | 6.4 2.8 11.0 2.8 77.1 


Overall System Accuracy: 68.2 


Table 4.8: Confusion matrix and performance of system using 5 diverse languages 
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of the matrix, each row can be viewed as a vector.! Figure 4.13 shows the results 


of the hierarchical clustering using the Euclidean distance as the similarity measure 
for the vectors. As might be expected, the four European languages are all clustered 
together in one branch with the Germanic and Romance languages in separate sub- 
branches. Also clustered together in a separate branch of the tree are the only two 
tonal languages in the set, Mandarin and Vietnamese. However, the tree also contains 
the unexpected clusterings of Japanese with French and Korean with Farsi. These 
results offer confirmation that the system is capturing at least some of the fundamental 
differences which occur among languages and language families. 


4.8 Receiver-Operator Characteristic 


An examination of the receiver-operator characteristic (ROC) of the system can help 
determine the reliability of the system’s scoring mechanism. The ROC reveals how 
a system’s performance is affected as the certainty its top-choice score is varied. The 
ALI design in this thesis uses actual probability values to determine not only which 
language is the most likely candidate but also the certainty with which it believes 
its top choice to be correct. If the system is accurately determining the a posteriori 
language probability of its top-choice language candidate then the likelihood of the 
system’s top-choice being correct will indeed increase as the a posteriori probability of 
the top-choice language candidate increases. The ROC of a system demonstrates how 
the system’s performance is affected by the introduction of a rejection region. The 
ROC is calculated by setting a threshold on the system’s top-choice score and rejecting 
all utterances which fall below that threshold. The threshold is varied to examine the 
system’s performance as the rejection region is spanned from 0% rejection to 100% 
rejection. The standard ROC curve for the system is shown in Figure 4.14. The 
standard ROC curve shows the percentage of correctly identified utterances which 
are accepted (i.e., detection rate) against the percentage of incorrectly identified ut- 
terances which are accepted (i.e., false alarm rate) as the rejection region is varied. 
Because information about the absolute accuracy of the system is not readily appar- 
ent in the standard ROC curve, an alternate view of the ROC is shown in Figure 4.15. 
This figure shows how the system’s overall accuracy is affected as the rejection region 
is increased. As can be seen in both figures, the system’s performance does indeed 
improve as the utterances with the lowest scores are rejected. However, there is also 
plenty of room for improvement. 


'In pairwise language identification, accuracies can be expected to range from a minimum of 50% 
to maximum of 100%. Thus, the more similar two languages are the more the system’s accuracy 
should move towards 50%. Thus, a value of 50 is chosen to fill into the diagonal elements of the 
matrix since no language is more similar to any language than itself. 
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Figure 4.14: Standard ROC curve for the ALI system 
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Figure 4.15: System accuracy over a varying rejection region 
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4.9 Rank Order Statistics 


The rank order statistic can be useful in determining the severity of the errors that are 
incurred by the system. When a system fails to identify the correct language with its 
top choice, it is hoped that the correct language is at least the second or third choice 
of the system. Overall, the rank order statistic for the system is 2.51. Figure 4.16 
shows how the system performs in the task of identifying the correct language of an 
utterance within the top n choices of its candidate list. As can be seen in the figure, 
the system identifies the correct language as one of its top three choices 76.3% of 
the time. Additionally, only 10.4% of the time is the correct language placed within 
the lower half of the candidate list. These results indicate that the system is able to 
provide a reliable list of alternative choices when its top choice is incorrect. 
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Figure 4.16: System accuracy in placing the correct language within the top n can- 
didates 
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Chapter 5 


Conclusion 


5.1 Summary 


This thesis has attempted to achieve three goals. The first goal was to present a 
formal probabilistic framework describing the ALI problem. This framework, which 
uses the ideas of House and Neuburg as a foundation, is presented in Chapter 2. The 
second goal was to present a new segment-based approach for ALI. This approach, 
which gains its structure from the probabilistic framework discussed in Chapter 2, 
is presented in Chapter 3. The third goal was to analyze and understand the var- 
ious modeling decisions, assumptions, and test conditions which affect the system’s 
performance. These analyses are presented in Chapters 3 and 4. Based on the inves- 
tigation described in this thesis, we can draw several important, although tentative, 
conclusions. These are summarized below. 

The House and Neuburg study indicated that the phonotactic constraints of lan- 
guages are very strong and could prove extremely useful for ALI. The results of the 
experiments conducted for this thesis are supportive of this claim. The language 
model component of the ALI system, which was designed to capture the phonotactic 
constraints of the different languages, performed better than all of the other models 
combined. However, experiments also showed that House and Neuburg’s proposal 
to represent the phonetic sequence with broad phonetic classes instead of detailed 
phonetic elements did not yield optimal performance. The results presented in Chap- 
ter 3 indicated that increasing the detail of the phonetic elements used to represent 
the phonetic sequence of an utterance helped the language identification performance 
even with the presence of phonetic recognition errors. 

Despite the fact that the language model was the most dominate component of 
the ALI system, this thesis showed that additional information, such as prosodic and 
acoustic information, can also be useful for language identification. When the acoustic 
and prosodic models were incorporated into the ALI system to support the language 
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model, the system’s accuracy increased from 41.7% to 48.6%. Additionally, despite 
the simplicity of the modeling of the prosodic features, the prosodic model proved to 
be more useful for language identification than the acoustic model. 


5.2 Assessment of System Performance 


As discussed in Chapter 1, it is very difficult to compare an ALI system to the 
state of the art in ALI because there have been very few studies which utilize a 
comparable evaluation task. With the recent release of the OGI Multi-Language 
Telephone Speech Corpus into the public domain, there now exists a common data 
set from which meaningful comparisons of different ALI approaches can be made. 
NIST is currently coordinating a series of ALI evaluations utilizing the OGI corpus to 
compare the approaches of eleven different research efforts.' To date, two studies have 
published preliminary results using the OGI corpus. These studies were conducted 
by Muthusamy and Cole [28] and by Zissman [36]. Table 5.1 shows how the results 
reported in this thesis compare to the results of their systems.” As can be seen, the 
system developed in this thesis is competitive with the other two systems.? 


Authors of Study System Accuracy 
Hazen August, 1993 48.6% 


Muthusamy and Cole | September, 1992 47.7% 
Apul, 1903 [40.0% 


Table 5.1: Summary of results using the OGI Multi-Language Telephone Speech 
Corpus 


The performance of the system is quite promising considering the difficulty of the 
task. The OGI corpus contains many features which can adversely affect the system’s 
performance. Some of the difficulties of the corpus include: 


e The data set is currently unlabeled making fully supervised training impossible. 
e The data set was collected over many different channels of varying qualities. 
e The data set was sampled at a rate of only 8 kHz, limiting its bandwidth. 


e A large portion of the data contains completely unconstrained speech. 


NIST is the National Institute of Standards and Technology. 
?Both groups are currently continuing their ALI research and improved results can be expected. 
3The training and test sets extracted from the OGI corpus were identical for all three systems. 
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5.3 Future Work 


5.3.1 System Improvements 


Though the system developed in this thesis has proven to be competitive with other 
current ALI systems, there are still many improvements that can be made. In par- 
ticular, future research will attempt to satisfy the following goals: 


e Improve the phonetic recognition component of the system. 
e Investigate methods for channel normalization. 
e Discover more useful segment-based features for acoustic modeling. 


e Develop modeling schemes to capture the correlation between the FO contour 
and the segment durations. 


e Develop modeling schemes to better capture the dynamic characteristics of the 
FO contour. 


e Examine different approaches for system integration. 


As shown in Chapter 3, the performance of the language modeling component of 
the system is very dependent on the quality of the representation of the underlying 
phonetic sequence. Therefore, it is important for the phonetic recognizer used by the 
ALI system to be as accurate as possible. Since neither of the phonetic recognizers 
used in this thesis was trained in a fully supervised fashion, large improvements in 
the phonetic recognition accuracy may not be possible until fully supervised training 
can be implemented. This may be feasible in the near future when the phonetic 
transcriptions of the OGI data become available. 

Because the OGI corpus was collected over the telephone lines using a different 
channel for every speaker, the acoustic qualities of the speech can vary significantly 
from speaker to speaker. Therefore, the ALI system should account for the acoustic 
differences between the channels in its modeling schemes to help avoid any channel 
dependencies which may arise. The ALI design in this thesis does not account for the 
channel differences. Therefore, future work will investigate methods, such as blind 
deconvolution, for channel normalization. 

The acoustic model used in this thesis attempts to model the acoustic information 
of the different phonemes in each language using segment-based feature vectors. The 
feature vectors that were used were relatively simple in nature; they contained the 
values of the MFCCs and delta MFCCs averaged over the length of a segment. The 
acoustic model may be improved by using a different set of features. In fact, the 
acoustic features that are useful for language identification may be quite different from 
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the features that are useful for phonetic recognition. It has been shown that useful 
segment-based acoustic measurements for phonetic recognition can be discovered in an 
automatic fashion [30]. It may be possible to automatically discover useful segment- 
based measurements for language identification in a similar fashion. Thus, future work 
will include attempting to discover more useful segment-based acoustic features. 

The prosodic model used in this thesis attempts to capture the FO and segment 
duration information using simple statistical properties. The independence assump- 
tions that were made may in fact be hurting the performance of the prosodic model. 
The first major assumption was to treat the segment durations and the FO contour as 
independent entities. Because of the correlations that may exist between the segment 
durations and the FO contour in the creation of the stress or tone of a segment, this 
assumption may be inappropriate. The second major assumption was to treat each 
frame of the FO contour as independent. This assumption eliminates almost all of the 
dynamic information contained in the FO contour. Thus, future work will include ef- 
forts to create models which can account for the correlations between the FO contour 
and segment durations as well as the dynamic nature of the FO contour over time. 

An additional assumption that was used in the prosodic model was that the un- 
voiced frames of the utterance carried no useful information and could be ignored. 
Although a preliminary experiment which incorporated the probability of voicing pa- 
rameter into the FO model did not yield any improved performance in the FO model, 
this is an assumption which also requires further study. 

As mentioned in Chapter 4, new procedures for integrating the different models 
into the complete system should be investigated. The static scaling method described 
in Chapter 3 does not account for the possibility that some models may contribute 
significantly more useful information than others as the length of an utterance is 
increased. Thus, methods for dynamically changing the scaling factors for each model 
as the length of an utterance increases should be investigated. 

The final system also did not use the same linguistic sequence C for each of 
the models. The language and acoustic models used a C which was represented 
with 59 phonetic classes while the segment duration model used a C which was 
represented with 29 phonetic classes. The fact that the models are not modeling 
the same probability space may be hurting the system. Thus, better methods need 
to be developed to find the single phonetic representation of C which optimizes the 
system’s performance. 


5.3.2 Incorporation into a Multi-Lingual System 


As mentioned in Chapter 1, an ALI system can be utilized as a component within a 
larger multi-lingual system. As a testbed for multi-lingual research, a multi-lingual 
information retrieval system is currently under development in the Spoken Language 
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Systems group at MIT. This system, known as the multi-lingual VOYAGER system, 
is designed to provide travel information for the city of Cambridge [38, 39, 42]. 
VOYAGER currently has the capability to understand queries in either English or 
Japanese [7], and is being ported to French, Italian and German. 

Within the multi-lingual VOYAGER domain, ALI can be performed as a two step 
process. ‘The first step is to perform a fast match to provide an ordered list of 
possible language candidates. The second step is to utilize the speech recognizer of 
the top choice language candidate to attempt to decipher the utterance. If the speech 
recognizer for the top-choice language fails to understand the utterance, the utterance 
is passed to the recognizer for the second choice, and so forth, until the system is able 
to understand the input query. In this scenario, the ALI design described in this 
thesis could be used to provide the language identification fast match. 

When the ALI system is incorporated into a system such as multi-lingual voy- 
AGER, the tradeoff between accuracy and efficiency is very important. Since higher 
level knowledge of each language is available, the entire system should be able to per- 
form nearly flawless language identification for sentences within it domain. The goal 
of the system is thus shifted from accuracy to speed. In this two-tiered approach to 
language identification, the optimal solution may involve creating an ALI fast match 
component which sacrifices accuracy for the sake of efficiency. 

Thus, future work will also be directed at incorporating the ALI design described 
in this thesis into the multi-lingual VOYAGER system. This will involve a careful 
study of the tradeoff between the design’s computational efficiency and its language 
identification accuracy. 
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Appendix A 


Families of OGI Languages 


Figure A.1 shows a tree describing the ten languages in the OGI corpus in terms 
of their linguistic origins [31]. It should be noted that the structure of the tree in 
Figure A.1 is derived from only one of many different hypotheses that linguists have 
proposed to describe the development of the different languages of the world. Fur- 
thermore, to date linguists have been unable to determine whether or not any of the 
approximately 30 primary language families of the world were developed from a single 
common source or whether these language families came into existence independently. 
Thus, it is only for aesthetic reasons that the structure of the tree in Figure A.1 is 
shown with the five primary language families originating from a common node. 
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; English 
Germanic West 
=e German 
; French 
Indo-European Italic —————— Romance <_ 
Spanish 
Indo-Iranian Iranian Farsi 


Altaic ag Japanese 


Korean 


Dravidian ———————————————_ South ————— Tamil 


Sino-Tibetan————— Sinitic —————— North—————. Mandarin 


Austro- Asiatic Mon-Khmer 


Viet-Muong— Vietnamese 


Figure A.1: Language family tree of the 10 languages in the OGI corpus 
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Appendix B 


Phone Sets of OGI Languages 


Table B.1 displays the phones which are used in each of the ten languages used 
in this thesis. The phones are written using the standard International Phonetic 
Association (IPA) alphabet. The table is created from the language specific phonetic 
lists compiled by Ruhlen [31]. These lists include all of the primary realizations of 
the phonemes of the language. However, as Ruhlen states, the lists do not always 
contain context specific allophones. For example, Ruhlen does not list the flap [r] as 
a phone in American English because it is simply a context dependent allophone for 
the phoneme /t/. Ruhlen also does not include diphthongs in his lists (although they 
are included for American English). Despite the incompleteness of the lists, they still 
provide a general idea of which sounds can be expected in each of the ten languages 
in the OGI database. 
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Phonetic hones in each language 
class Pen [Far [RT Ger [Jap [Ror PM Spa 


Vowels i Ns a Os oe (a = i i i 
eo }|aol}]en € 
aa u OY e€ | eo 
ou u UW ao | AO 
5 ees 


Stops pe b 
thd 
kb g 


Fricatives 


Table B.1: Phone Sets of Languages in OGI Database 
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